Short-Long_RNAseq-Differentiation-Timecourse
To do:
- write discussion for gene set analysis
- write outlook
Differential Gene Expression and Gene Set Analyes
Methods
The differential gene expression and gene set analyses were performed in RStudio (Posit team (2023), version 2023.6.2.561, R version 4.3.1).
Differntial Gene Expression Analysis
After importing and processing of the data, as well as an exploratory data analysis, a differential gene expression analysis was conducted. Here the aim was twofold: first, the aim was to identify genes that are differentially expressed over the course of time, so over the course of the 5 days at which samples have been generated; second, the aim was to compare the results of the differential gene expression analysis of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data). To do so, in total 8 different count data frames were analyzed. For each of the 4 NDR cutoffs (0.025, 0.05, 0.1, 0.2), there are two different count data frames, one derived from Illumina+Salmon and one derived from PacBio+Bambu. For the analysis of differentially expressed genes, a standard pipeline from the R package edgeR (Robinson et al, 2010, R package version 3.42.4) was used.
First, DGEList objects were created from the 8 count data frames and then filtered using the filterByExpr function, which determines genes that have large enough counts to be used for the differential expression analysis or other statistical analyses. The function removes genes that do not have a sufficient count in a worthwhile number of samples. Next, scaling factors for the library sizes were computed first using the function calcNormFactors to obtain normalized effective library sizes. Following, the function estimateDisp was applied on the DGEList objects to calculate posterior dispersion estimates. To do so, the following formula was used to create a design matrix: ~ day, where day specifies the day at which each sample was created, so days 0 to 5. Thus, the design matrix is created in a way that the first time point, day 0, is the baseline (intercept of the model). The next step is to fit a negative binomial generalized log-linear model to each gene. This was done using the function glmFit. Finally, a likelihood ratio test was performed using the function glmLRT. Here, the full model is compared to a reduced model which has values of zero for all coefficients but the baseline, so for days 1 to 5. For each gene, this computes p-values and other test statistics of the likelihood ratio test that were used afterwards to identify genes that show statistical significant differential expression. To find these genes, the function decideTests.DGELRT adjusted the p-values by the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) and then filtered differentially expressed genes by a significance threshold of 0.01.
To compare the results for Illumina+Salmon with the results for PacBio+Bambu, an UpSetR (Gehlenborg, 2019, R package version 1.4.0) plot was created for each NDR cutoff. This plot overlayed the genes called differentially expressed for the two different cases. Moreover, the p-values and logFC values (comparing day 0 and day 5) for each gene were compared using simple scatter plots.
Lastly, the differential expression pipeline described above was also applied to all data sets to compare the time points day 0 and day 5 discretely. To do so, the in the likelihood ratio test only the coefficient of day 5 in the reduced model was set to 0.
Grouping of Differentially Expressed Genes
After the differential gene expression analysis, the genes identified as differentially expressed were grouped into gene sets of similar genes. For this grouping, only the genes that were identified as differentially expressed by the analysis of both the Illumina+Salmon and the PacBio+Bambu data of the 0.2 NDR cutoff were considered. Moreover, the genes were grouped depending on their Illumina+Salmon counts since these counts were treated as a ground truth to which the PacBio+Bambu counts were compared.
Of these genes, the average expression at each time point was calculated first, so the average of the samples of each time point. Then, all gene-gene Spearman-correlations were computed and 1-correlation was used as the difference between two genes during hierarchical clustering. The results of the hierarchical clustering were then evaluated to group the genes into clusters. From visual inspection of a cluster dendrogram and a correlation heat map, the number of clusters was set manually. Lastly, the average expression of the genes of the clusters at each time point was calculated for each cluster.
The clusters were then used as gene sets for the following analysis of the functions or pathways that are associated with the genes identified as differentially expressed.
Gene Set Analysis
To make sense of the genes that were called differentially expressed, they were first grouped into gene sets of similar genes (see above). These gene sets were then used as input for a function and pathway analysis. Here, the frequency-based method DAVID (Sherman et al, 2022, https://david.ncifcrf.gov/home.jsp) was employed.
The Database for Annotation, Visualization and Integrated Discovery (DAVID) tool is a web tool for functional annotation and enrichment analyses of gene sets, e.g. to find enriched gene ontology terms. To do so, the web tool requires the user to upload not only a list of genes, so the gene set of interest, but also another list of genes, the background. The tool then searches for functional annotations that are statistically overrepresented (enriched) in the gene set of interest compared to the background. Therefore, the choice of background can influence the results of the analysis. Here, the gene sets of interest were the clusters specified during the grouping of the differentially expressed genes and the for the background, all genes after the filtering were used. The only exception to that are genes that were newly discovered during the transcript quantification since these genes do not have a valid ensemble gene id which is necessary for DAVID.
Results
Load packages
Load data
The data is loaded from RDS-objects that have been created with the import_processing.qmd file. For each of the 4 NDR cutoffs (0.025, 0.05, 0.1, 0.2), two data frames exist containing the expression data (counts) of the two different technologies used to obtain the data:
- Illumina sequencing and Salmon quantification (short read)
- PacBio sequencing and Bambu quantification (long read)
In total, this adds up to 8 different data frames. Each of these data frames contains 11 samples, which are replicates from one of five different time points. Each time point has 1-3 replicates.
bambu_0.025_gene <- df_list_bambu$bambu_0.025_gene
bambu_0.05_gene <- df_list_bambu$bambu_0.05_gene
bambu_0.1_gene <- df_list_bambu$bambu_0.1_gene
bambu_0.2_gene <- df_list_bambu$bambu_0.2_gene
salmon_0.025_gene <- df_list_salmon$salmon_0.025_gene
salmon_0.05_gene <- df_list_salmon$salmon_0.05_gene
salmon_0.1_gene <- df_list_salmon$salmon_0.1_gene
salmon_0.2_gene <- df_list_salmon$salmon_0.2_gene
(meta_samples <- df_list_meta$metadata) sampleID condition time
1 1 day0.rep1 0
2 2 day0.rep2 0
3 3 day0.rep3 0
4 4 day1.rep1 1
5 5 day2.rep1 2
6 6 day3.rep1 3
7 7 day3.rep2 3
8 8 day4.rep1 4
9 9 day5.rep1 5
10 10 day5.rep2 5
11 11 day5.rep3 5
Create DGEList objects and filter expressed genes
The expression data is then filtered using the function filterByExpr to exclude genes that do not show high enough expression or do not show expression in enough samples.
NDR 0.025
y_salmon_0.025 <- DGEList(counts = salmon_0.025_gene, group = day)
y_bambu_0.025 <- DGEList(counts = bambu_0.025_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.025)
expressed_bambu <- filterByExpr(y_bambu_0.025)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.025 <-
y_salmon_0.025[-which(rownames(salmon_0.025_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.025 <-
y_bambu_0.025[-which(rownames(bambu_0.025_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}NDR 0.05
y_salmon_0.05 <- DGEList(counts = salmon_0.05_gene, group = day)
y_bambu_0.05 <- DGEList(counts = bambu_0.05_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.05)
expressed_bambu <- filterByExpr(y_bambu_0.05)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.05 <-
y_salmon_0.05[-which(rownames(salmon_0.05_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.05 <-
y_bambu_0.05[-which(rownames(bambu_0.05_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}NDR 0.1
y_salmon_0.1 <- DGEList(counts = salmon_0.1_gene, group = day)
y_bambu_0.1 <- DGEList(counts = bambu_0.1_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.1)
expressed_bambu <- filterByExpr(y_bambu_0.1)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.1 <-
y_salmon_0.1[-which(rownames(salmon_0.1_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.1 <-
y_bambu_0.1[-which(rownames(bambu_0.1_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}NDR 0.2
y_salmon_0.2 <- DGEList(counts = salmon_0.2_gene, group = day)
y_bambu_0.2 <- DGEList(counts = bambu_0.2_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.2)
expressed_bambu <- filterByExpr(y_bambu_0.2)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.2 <-
y_salmon_0.2[-which(rownames(salmon_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.2 <-
y_bambu_0.2[-which(rownames(bambu_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}Differential Gene Expression Analysis
The differential gene expression analysis aims to identify genes that are differentially expressed over the time course. These results are computed for all 4 NDR cutoffs of both technologies, and are then overlayed for each NDR cutoff to compare the two technologies, so the long read and short read sequencing.
The design matrix is created in a way that the first time point, day0, is used as a baseline (intercept of the model).
(Intercept) day1 day2 day3 day4 day5
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 0 0 0 0 0
4 1 1 0 0 0 0
5 1 0 1 0 0 0
6 1 0 0 1 0 0
7 1 0 0 1 0 0
8 1 0 0 0 1 0
9 1 0 0 0 0 1
10 1 0 0 0 0 1
11 1 0 0 0 0 1
attr(,"assign")
[1] 0 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$day
[1] "contr.treatment"
NDR 0.025
The first NDR cutoff is 0.025. The differential expression analysis is performed with edgeR, and genes are considered as differentially expressed if the adjusted p-value of the likelihood ratio test between the full and the reduced model is below 0.01. The p-value adjustment is performed using the Benjamini-Hochberg method.
# Salmon 0.025
y_salmon_0.025 <- calcNormFactors(y_salmon_0.025)
## estimate dispersion
y_salmon_0.025 <- estimateDisp(y_salmon_0.025, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.025, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_s0.025 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.025) day5-day4-day3-day2-day1
NotSig 17374
Sig 7343
# Bambu 0.025
y_bambu_0.025 <- calcNormFactors(y_bambu_0.025)
## estimate dispersion
y_bambu_0.025 <- estimateDisp(y_bambu_0.025, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.025, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.025 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.025) day5-day4-day3-day2-day1
NotSig 16879
Sig 7838
For Illumina+Salmon, out of the 24717 tested genes, 7343 were identified as differentially expressed. In the case of PacBio+Bambu, 7838 genes were identified as differentially expressed. Both methods identify roughly the same amount of genes as differentially expressed.
# overlay results of both technologies
de_genes_0.025 <- as.data.frame(cbind(decide_dif_s0.025, decide_dif_b0.025))
colnames(de_genes_0.025) <- c("Salmon_0.025", "Bambu_0.025")
upset(de_genes_0.025, sets = colnames(de_genes_0.025))From the UpSet plot above it is visible that, even though identifying roughly the same number of genes as differentially expressed, the two technologies show some differences in their results. The majority of genes identified overlap, however, there are still some genes only called differentially expressed by either Illumina+Salmon or by PacBio+Bambu. To further investigate the similarities/differences, the logFC and p-values of the analysis was compared. The p-values are calculated for the likelihood ratio test, the logFC values, however, are calculated for the discrete comparisons of the days 1-5 with day 0. Here, the logFC values of day 5 were compared.
# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.025)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.025)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The comparison of the p-values after adjustment shows a similar trend in both technologies, but also some differences. The same is observable for the logFC values. Additionally, it is visible that many genes that show a logFC value of zero in the analysis of one technology shows a (high) logFC value in the other one, which agrees with the observation made in the UpSet plot, that some genes are called differentially expressed by either of the two technologies, but not by both.
After performing the differential gene expression analysis over the whole time course, the first and the last time points (days 0 and 5) were also compared as discrete conditions to identify up- or down-regulated genes only comparing the begin and the end of the experiment.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.025, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.025_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.025_tp) day5
Down 2377
NotSig 18283
Up 4057
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.025_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.025 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.025, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.025_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.025_tp) day5
Down 2014
NotSig 19559
Up 3144
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.025_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.025 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The analysis identifies 2377 down-regulated and 4057 up-regulated genes for Illumian+Salmon and 2014 down-regulated and 3144 up-regulated genes for PacBio+Bambu. Again, 24717 genes were tested and both technologies show similar results.
NDR 0.05
The next NDR cutoff is 0.05. The same analysis steps that were performed for the NDR cutoff of 0.025 were performed here.
# Salmon 0.05
y_salmon_0.05 <- calcNormFactors(y_salmon_0.05)
## estimate dispersion
y_salmon_0.05 <- estimateDisp(y_salmon_0.05, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.05, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_s0.05 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.05) day5-day4-day3-day2-day1
NotSig 17444
Sig 7367
# Bambu 0.05
y_bambu_0.05 <- calcNormFactors(y_bambu_0.05)
## estimate dispersion
y_bambu_0.05 <- estimateDisp(y_bambu_0.05, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.05, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.05 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.05) day5-day4-day3-day2-day1
NotSig 16927
Sig 7884
For Illumina+Salmon, out of the 24811 tested genes, 7367 were identified as differentially expressed. In the case of PacBio+Bambu, 7884 genes were identified as differentially expressed. Both methods identify roughly the same amount of genes as differentially expressed.
# overlay results of both technologies
de_genes_0.05 <- as.data.frame(cbind(decide_dif_s0.05, decide_dif_b0.05))
colnames(de_genes_0.05) <- c("Salmon_0.05", "Bambu_0.05")
upset(de_genes_0.05, sets = colnames(de_genes_0.05))The UpSet plot shows similar results compared to the NDR cutoff of 0.025. Again, the majority of genes are detected by both methods, however, some are detected by either of the two methods.
# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.05)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.05)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The comparison of adjusted p-values and logFC values shows the same results as seen when performing the comparison on the results of the 0.025 NDR cutoff.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.05, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.05_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.05_tp) day5
Down 2388
NotSig 18352
Up 4071
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.05_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.05 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.05, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.05_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.05_tp) day5
Down 2031
NotSig 19608
Up 3172
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.05_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.05 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The differential expression analysis of day 5 and day 0 identifies 2388 down-regulated and 4071 up-regulated genes for Illumian+Salmon and 2031 down-regulated and 3172 up-regulated genes for PacBio+Bambu. Again, 24811 genes were tested and both technologies show similar results.
NDR 0.1
The next NDR cutoff is 0.1. The same analysis steps that were performed for the other NDR cutoffs were performed here as well.
# Salmon 0.1
y_salmon_0.1 <- calcNormFactors(y_salmon_0.1)
## estimate dispersion
y_salmon_0.1 <- estimateDisp(y_salmon_0.1, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.1, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_s0.1 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.1) day5-day4-day3-day2-day1
NotSig 17575
Sig 7529
# Bambu 0.1
y_bambu_0.1 <- calcNormFactors(y_bambu_0.1)
## estimate dispersion
y_bambu_0.1 <- estimateDisp(y_bambu_0.1, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.1, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.1 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.1) day5-day4-day3-day2-day1
NotSig 17056
Sig 8048
The differential expression analysis of the whole time course identified 7529 genes as differentially expressed for Illumina+Salmon and 8048 for PacBio+Bambu. 25104 genes were tested.
# overlay results of both technologies
de_genes_0.1 <- as.data.frame(cbind(decide_dif_s0.1, decide_dif_b0.1))
colnames(de_genes_0.1) <- c("Salmon_0.1", "Bambu_0.1")
upset(de_genes_0.1, sets = colnames(de_genes_0.1))When overlaying the results of the two technologies, it is again visible that most genes are detected by both methods, while some are deteced exclusively by one of them.
# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.1)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.1)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The comparison of adjusted p-values and logFC values shows the same trends as seen for the other NDR cutoffs.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.1, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.1_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.1_tp) day5
Down 2467
NotSig 18544
Up 4093
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.1_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.1 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.1, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.1_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.1_tp) day5
Down 2093
NotSig 19746
Up 3265
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.1_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.1 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The time point differential expression analysis (day5 and day0) identified 2467 down-regulated and 4093 up-regulated genes for Illumina+Salmon and 2093 down-regulated and 3265 up-regulated genes for PacBio+Bambu. In total, 25104 were tested.
NDR 0.2
The last NDR cutoff is 0.2. Again, the same steps are performed.
# Salmon 0.2
y_salmon_0.2 <- calcNormFactors(y_salmon_0.2)
## estimate dispersion
y_salmon_0.2 <- estimateDisp(y_salmon_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
## save lrt for pathway analysis
lrt_pathway <- lrtS
#topTags(lrt)
## multiple testing correction
decide_dif_s0.2 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.2) day5-day4-day3-day2-day1
NotSig 18289
Sig 7723
# Bambu 0.2
y_bambu_0.2 <- calcNormFactors(y_bambu_0.2)
## estimate dispersion
y_bambu_0.2 <- estimateDisp(y_bambu_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.2 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.2) day5-day4-day3-day2-day1
NotSig 17667
Sig 8345
The whole time course differential expression analysis identified 7723 deferentially expressed genes for Illumina+Salmon and 8345 for PacBio+Bambu. For the NDR cutoff of 0.2, 26012 genes were tested.
# overlay results of both technologies
de_genes_0.2 <- as.data.frame(cbind(decide_dif_s0.2, decide_dif_b0.2))
colnames(de_genes_0.2) <- c("Salmon_0.2", "Bambu_0.2")
upset(de_genes_0.2, sets = colnames(de_genes_0.2))# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.2)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.2)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The overlay of the results and the logFC/p-value comparsion showed similar results compared to the other NDR cutoffs.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.2, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.2_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.2_tp) day5
Down 2556
NotSig 19230
Up 4226
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.2_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.2 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.2, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.2_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.2_tp) day5
Down 2193
NotSig 20389
Up 3430
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.2_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.2 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))For the time point comparison, 2467 down-regulated and 4093 up-regulated genes for Illumina+Salmon and 2093 down-regulated and 3265 up-regulated genes for PacBio+Bambu were identified. In total, 26012 were tested.
The differential gene expression analysis mainly showed two things. First, it showed that the results of the different NDR cutoff rates are very similar, the main difference here is that more genes are in the data in the first place from the quantification. Second, it showed that the results of the two different technologies, Illumina+Salmon and PacBio+Bambu, are quite similar, but also show some differences. When treating the Illumina+Salmon data as the ground truth, it can be said that PacBio+Bambu is pretty close to the truth, but still shows some deviations.
Following a differential gene expression analysis, the identified genes can be grouped into gene sets of similar genes and a function or pathways analysis can be conducted using these gene sets. To do so, only the genes that were detected by the analyses of both technologies were used, so the overlap which can be seen in the UpSet plot. Since the results of the different NDR cutoffs are very similar, only one of the cutoffs, namely the NDR cutoff of 0.2, was used. This is the highest cutoff and therefore the data includes the most transcripts/genes.
Grouping of Differentially Expressed Genes
To group the differentially expressed genes, the average expression of each gene at each time point was calculated first.
# create a list of the genes called DE by both methods
s <- which(de_genes_0.2$Salmon_0.2 == 1)
b <- which(de_genes_0.2$Bambu_0.2 == 1)
s_b <- intersect(s, b)
DEG <- rownames(de_genes_0.2[s_b,])
# calculate the average expression of each gene at each time point (average of the replicates)
avg_expr <- as.data.frame(sapply(sort(unique(meta_samples$time))[c(1,4,6)], function(time) {
rowMeans(salmon_0.2_gene[, which(meta_samples$time == time)])
}))
# for time points 1, 2 and 4, only one replicate exists
avg_expr$V4 <- unlist(as.vector(salmon_0.2_gene[, which(meta_samples$time == 1)]))
avg_expr$V5 <- unlist(as.vector(salmon_0.2_gene[, which(meta_samples$time == 2)]))
avg_expr$V6 <- unlist(as.vector(salmon_0.2_gene[, which(meta_samples$time == 4)]))
avg_expr <- avg_expr[, c(1,4,5,2,6,3)]
colnames(avg_expr) <- sort(unique(meta_samples$time))The genes are then clustered using the their correlation as a distance measure (to be more precise: 1-correlation). The cluster dendrogram of the hierarchical clustering as well as a heat map of the correlation including the dendrogram are shown below.
corr_DEG <- cor(t(avg_expr[DEG,]), method = "spearman")
hcl_DEG <- hclust(as.dist(1 - corr_DEG), method = "complete")
plot(hcl_DEG, label = FALSE)cl_DEG <- cutree(hcl_DEG, k = 4)
heatmap.2(corr_DEG, Rowv = as.dendrogram(hcl_DEG), Colv = as.dendrogram(hcl_DEG),
trace = "none", scale = "none", labRow = NA, labCol = NA, col = viridis,
ColSideColors = rainbow(15)[cl_DEG])The genes were then split into 4 clusters by visual inspection, which can be seen in the plot above. The bar indicates the 4 clusters by different colors (yellow, red, green and orange). The average expression of the genes of the 4 clusters at each time point are shown below:
avg_expr_DEG_list <- tapply(names(cl_DEG), cl_DEG, function(x) avg_expr[x,])
scaled_expr_DEG_list <- lapply(avg_expr_DEG_list, function(x) t(scale(t(x))))
layout(matrix(1:4, nrow = 2, byrow = T))
for(cl in 1:4)
boxplot(scaled_expr_DEG_list[[cl]],
main = paste0(cl, " (", nrow(scaled_expr_DEG_list[[cl]]), ")"))The expression pattern of cluster 1 seems to follow a downwards trend, meaning that cluster 1 might contain mostly genes that are down-regulated over the time course. Clusters 2 and 3 show an upwards trend in expression, so these clusters might contain mostly genes that are up-regulated over the time course. Cluster 4 does not follow an apparent trend, there are many outliers visible in at all time points and the hight of the boxes seem to fluctuate rather than following a trend. Moreover, this cluster is also the smallest one.
For the following gene set analysis, the 4 clusters were used as gene sets.
Gene Set Analysis
The gene set analysis was performed using two different methods. First, the method DAVID was used, which is a frequency-based method. Second, the method GSEA was used, which is a ranked-based method.
DAVID (frequency-based)
To perform DAVID on the 4 gene sets, the ensemble gene ids of the genes must be extracted and then written into text files, which are uploaded to the DAVID web tool. Since same of the newly discovered transcripts could not be mapped to an existing gene, there are some genes in the gene sets that do not have an ensemble gene id compatible with DAVID. These genes were therefore removed from the analysis. As explained in the methods part, the results of DAVID depend on the background gene set which is used during the analysis. Here, the background list contained all genes of the 0.2 NDR cutoff data after filtering, with the exception of newly identified genes.
get_ensemble_id <- function(input_string) {
ensemble_id_version <- unlist(strsplit(input_string, "\\."))
return(ensemble_id_version[1])
}
input <- names(which(cl_DEG == 1))
names_cluster1 <- unname(sapply(input, get_ensemble_id))
input <- names(which(cl_DEG == 2))
names_cluster2 <- unname(sapply(input, get_ensemble_id))
input <- names(which(cl_DEG == 3))
names_cluster3 <- unname(sapply(input, get_ensemble_id))
input <- names(which(cl_DEG == 4))
names_cluster4 <- unname(sapply(input, get_ensemble_id))
input <- rownames(y_salmon_0.2)
background <- unname(sapply(input, get_ensemble_id))
write.table(
names_cluster1[-which(startsWith(names_cluster1, "Bambu"))],
file = "../genes_C1.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
names_cluster2,
file = "../genes_C2.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
names_cluster3,
file = "../genes_C3.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
names_cluster4,
file = "../genes_C4.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
background[-which(startsWith(background, "Bambu"))],
file = "../genes_expressed.txt",
quote = F,
row.names = F,
col.names = F
)The results were downloaded from the DAVID web tool as text files and contain terms describing the functions or pathways different genes are involved in. The results also contain other measures, like the count of genes of the input gene set that are associated with a specific term, as well as adjusted p-values showing the significance of this term in the input gene set. The results were then filtered by the adjusted p-values. The adjustment was conducted using the Benjamini-Hochberg method. Following, the results were sorted by their adjusted p-values in ascending order.
# import DAVID results
cluster1 <- read.csv("../DAVID/C1_expressed_background.txt", sep = "\t")
cluster2 <- read.csv("../DAVID/C2_expressed_background.txt", sep = "\t")
cluster3 <- read.csv("../DAVID/C3_expressed_background.txt", sep = "\t")
cluster4 <- read.csv("../DAVID/C4_expressed_background.txt", sep = "\t")
# get significant terms
cluster1_sig <- cluster1 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster1_terms <- cluster1_sig$Term) [1] "KW-0770~Synapse"
[2] "KW-0628~Postsynaptic cell membrane"
[3] "KW-1003~Cell membrane"
[4] "hsa04727:GABAergic synapse"
[5] "hsa04723:Retrograde endocannabinoid signaling"
[6] "KW-0325~Glycoprotein"
[7] "IPR018000:Neurotransmitter-gated ion-channel, conserved site"
[8] "IPR006029:Neurotransmitter-gated ion-channel transmembrane domain"
[9] "IPR006202:Neurotransmitter-gated ion-channel ligand-binding"
[10] "IPR006201:Neurotransmitter-gated ion-channel"
[11] "GO:0030594~neurotransmitter receptor activity"
[12] "hsa05033:Nicotine addiction"
[13] "hsa05032:Morphine addiction"
[14] "GO:0045202~synapse"
[15] "KW-1015~Disulfide bond"
[16] "GO:0050877~neurological system process"
[17] "GO:0007268~chemical synaptic transmission"
[18] "KW-1071~Ligand-gated ion channel"
[19] "hsa04724:Glutamatergic synapse"
[20] "GO:0045211~postsynaptic membrane"
[21] "KW-0732~Signal"
[22] "DOMAIN:Neurotransmitter-gated ion-channel transmembrane"
[23] "CARBOHYD:N-linked (GlcNAc...) asparagine"
[24] "GO:0005230~extracellular ligand-gated ion channel activity"
[25] "DOMAIN:Neurotransmitter-gated ion-channel ligand-binding"
[26] "KW-0472~Membrane"
[27] "GO:0099060~integral component of postsynaptic specialization membrane"
[28] "GO:0005886~plasma membrane"
[29] "hsa04725:Cholinergic synapse"
[30] "KW-0429~Leber hereditary optic neuropathy"
cluster2_sig <- cluster2 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster2_terms <- cluster2_sig$Term) [1] "KW-0325~Glycoprotein"
[2] "KW-1015~Disulfide bond"
[3] "CARBOHYD:N-linked (GlcNAc...) asparagine"
[4] "KW-0732~Signal"
[5] "KW-0964~Secreted"
[6] "GO:0005615~extracellular space"
[7] "KW-0217~Developmental protein"
[8] "KW-9996~Developmental protein"
[9] "GO:0005886~plasma membrane"
[10] "GO:0005576~extracellular region"
[11] "KW-0371~Homeobox"
[12] "IPR020479:Homeodomain, metazoa"
[13] "IPR001356:Homeodomain"
[14] "IPR017970:Homeobox, conserved site"
[15] "SM00389:HOX"
[16] "TOPO_DOM:Extracellular"
[17] "DNA_BIND:Homeobox"
[18] "KW-0272~Extracellular matrix"
[19] "GO:0005887~integral component of plasma membrane"
[20] "IPR009057:Homeodomain-like"
[21] "GO:0001525~angiogenesis"
[22] "GO:1990837~sequence-specific double-stranded DNA binding"
[23] "KW-0106~Calcium"
[24] "GO:0009952~anterior/posterior pattern specification"
[25] "GO:0031012~extracellular matrix"
[26] "KW-0130~Cell adhesion"
[27] "KW-0675~Receptor"
[28] "KW-1003~Cell membrane"
[29] "GO:0009986~cell surface"
[30] "IPR001827:Homeobox protein, antennapedia type, conserved site"
[31] "GO:0005201~extracellular matrix structural constituent"
[32] "TOPO_DOM:Cytoplasmic"
[33] "GO:0005788~endoplasmic reticulum lumen"
[34] "GO:0009897~external side of plasma membrane"
[35] "GO:0007155~cell adhesion"
[36] "GO:0001501~skeletal system development"
[37] "DOMAIN:Homeobox"
[38] "GO:0007165~signal transduction"
[39] "GO:0000785~chromatin"
[40] "IPR013783:Immunoglobulin-like fold"
[41] "GO:0048704~embryonic skeletal system morphogenesis"
[42] "MOTIF:Antp-type hexapeptide"
[43] "IPR000742:Epidermal growth factor-like domain"
[44] "KW-0037~Angiogenesis"
[45] "GO:0001570~vasculogenesis"
[46] "IPR001881:EGF-like calcium-binding"
[47] "KW-0245~EGF-like domain"
[48] "IPR009030:Insulin-like growth factor binding protein, N-terminal"
[49] "hsa04611:Platelet activation"
[50] "KW-0393~Immunoglobulin domain"
[51] "SM00181:EGF"
[52] "IPR003599:Immunoglobulin subtype"
[53] "DOMAIN:EGF-like 1"
[54] "GO:0043235~receptor complex"
[55] "IPR017995:Homeobox protein, antennapedia type"
[56] "GO:0045766~positive regulation of angiogenesis"
[57] "GO:0030198~extracellular matrix organization"
[58] "GO:0010628~positive regulation of gene expression"
[59] "GO:0005509~calcium ion binding"
[60] "SM00179:EGF_CA"
[61] "IPR007110:Immunoglobulin-like domain"
[62] "KW-0165~Cleavage on pair of basic residues"
[63] "GO:0003180~aortic valve morphogenesis"
[64] "IPR018097:EGF-like calcium-binding, conserved site"
[65] "DOMAIN:EGF-like"
[66] "GO:0001228~transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific binding"
[67] "GO:0045944~positive regulation of transcription from RNA polymerase II promoter"
[68] "IPR000152:EGF-type aspartate/asparagine hydroxylation site"
[69] "hsa04610:Complement and coagulation cascades"
[70] "GO:0005925~focal adhesion"
[71] "GO:0007267~cell-cell signaling"
[72] "GO:0007179~transforming growth factor beta receptor signaling pathway"
[73] "GO:0014068~positive regulation of phosphatidylinositol 3-kinase signaling"
[74] "GO:0030335~positive regulation of cell migration"
[75] "SM00409:IG"
[76] "GO:0006954~inflammatory response"
[77] "DOMAIN:Ig-like C2-type 2"
[78] "DOMAIN:Ig-like C2-type 1"
[79] "DOMAIN:EGF-like 2"
[80] "GO:0005518~collagen binding"
[81] "MOTIF:Cell attachment site"
[82] "GO:0001958~endochondral ossification"
[83] "GO:0030199~collagen fibril organization"
[84] "DOMAIN:EGF-like 6"
[85] "GO:0001649~osteoblast differentiation"
[86] "hsa04151:PI3K-Akt signaling pathway"
[87] "hsa04510:Focal adhesion"
[88] "GO:0008201~heparin binding"
[89] "GO:0001568~blood vessel development"
[90] "DOMAIN:EGF-like 3"
[91] "KW-0358~Heparin-binding"
[92] "KW-0176~Collagen"
[93] "GO:0050840~extracellular matrix binding"
[94] "GO:0051216~cartilage development"
[95] "hsa04512:ECM-receptor interaction"
[96] "GO:0007186~G-protein coupled receptor signaling pathway"
[97] "GO:0060065~uterus development"
[98] "DOMAIN:Ig-like"
[99] "GO:0008284~positive regulation of cell proliferation"
[100] "KW-0765~Sulfation"
[101] "KW-0221~Differentiation"
[102] "DOMAIN:EGF-like 5"
[103] "DOMAIN:EGF-like 4"
[104] "KW-0430~Lectin"
[105] "GO:0034446~substrate adhesion-dependent cell spreading"
[106] "KW-0472~Membrane"
[107] "GO:0032967~positive regulation of collagen biosynthetic process"
[108] "GO:0003148~outflow tract septum morphogenesis"
[109] "hsa05150:Staphylococcus aureus infection"
[110] "hsa04060:Cytokine-cytokine receptor interaction"
[111] "IPR003598:Immunoglobulin subtype 2"
[112] "GO:0051897~positive regulation of protein kinase B signaling"
[113] "GO:0005178~integrin binding"
[114] "GO:0030509~BMP signaling pathway"
[115] "GO:0048844~artery morphogenesis"
[116] "GO:0030154~cell differentiation"
[117] "GO:2000352~negative regulation of endothelial cell apoptotic process"
[118] "GO:0035115~embryonic forelimb morphogenesis"
[119] "GO:0005581~collagen trimer"
[120] "GO:0045121~membrane raft"
[121] "GO:0019955~cytokine binding"
[122] "GO:0050679~positive regulation of epithelial cell proliferation"
[123] "hsa04015:Rap1 signaling pathway"
[124] "GO:0055010~ventricular cardiac muscle tissue morphogenesis"
[125] "GO:0060038~cardiac muscle cell proliferation"
[126] "GO:0016477~cell migration"
[127] "GO:0060021~palate development"
[128] "GO:0003700~transcription factor activity, sequence-specific DNA binding"
[129] "GO:0048407~platelet-derived growth factor binding"
[130] "GO:0002062~chondrocyte differentiation"
[131] "GO:0007596~blood coagulation"
[132] "GO:0035924~cellular response to vascular endothelial growth factor stimulus"
[133] "GO:0070374~positive regulation of ERK1 and ERK2 cascade"
[134] "GO:0006955~immune response"
[135] "GO:0043627~response to estrogen"
[136] "GO:0007507~heart development"
[137] "GO:0009954~proximal/distal pattern formation"
[138] "GO:0008285~negative regulation of cell proliferation"
[139] "GO:0007160~cell-matrix adhesion"
[140] "KW-0654~Proteoglycan"
[141] "IPR000372:Leucine-rich repeat-containing N-terminal"
[142] "GO:0060395~SMAD protein signal transduction"
[143] "GO:0016525~negative regulation of angiogenesis"
[144] "GO:0035025~positive regulation of Rho protein signal transduction"
[145] "GO:0051145~smooth muscle cell differentiation"
[146] "GO:0038023~signaling receptor activity"
[147] "GO:0003007~heart morphogenesis"
[148] "GO:0014911~positive regulation of smooth muscle cell migration"
[149] "hsa05144:Malaria"
[150] "GO:0010718~positive regulation of epithelial to mesenchymal transition"
[151] "GO:0001837~epithelial to mesenchymal transition"
[152] "GO:0001503~ossification"
[153] "GO:0008217~regulation of blood pressure"
[154] "GO:0031100~animal organ regeneration"
[155] "GO:0002053~positive regulation of mesenchymal cell proliferation"
[156] "GO:0010463~mesenchymal cell proliferation"
[157] "IPR008160:Collagen triple helix repeat"
[158] "hsa05152:Tuberculosis"
[159] "GO:0001938~positive regulation of endothelial cell proliferation"
[160] "GO:0000976~transcription regulatory region sequence-specific DNA binding"
[161] "GO:0005796~Golgi lumen"
[162] "GO:0009611~response to wounding"
[163] "GO:0050729~positive regulation of inflammatory response"
[164] "GO:0001656~metanephros development"
[165] "SM00013:LRRNT"
[166] "GO:0010629~negative regulation of gene expression"
[167] "GO:0015026~coreceptor activity"
[168] "SM00408:IGc2"
[169] "TRANSMEM:Helical"
[170] "GO:0003151~outflow tract morphogenesis"
[171] "IPR016186:C-type lectin-like"
[172] "hsa05200:Pathways in cancer"
[173] "hsa04380:Osteoclast differentiation"
[174] "GO:0005102~receptor binding"
[175] "GO:0007275~multicellular organism development"
[176] "GO:0016021~integral component of membrane"
[177] "GO:0090050~positive regulation of cell migration involved in sprouting angiogenesis"
[178] "GO:0030246~carbohydrate binding"
[179] "GO:0000981~RNA polymerase II transcription factor activity, sequence-specific DNA binding"
[180] "GO:0001822~kidney development"
[181] "GO:0072562~blood microparticle"
[182] "GO:0031093~platelet alpha granule lumen"
[183] "GO:0071773~cellular response to BMP stimulus"
[184] "GO:0042475~odontogenesis of dentin-containing tooth"
[185] "GO:0045599~negative regulation of fat cell differentiation"
[186] "GO:0045165~cell fate commitment"
[187] "GO:0048146~positive regulation of fibroblast proliferation"
[188] "GO:0000978~RNA polymerase II core promoter proximal region sequence-specific DNA binding"
[189] "GO:0004888~transmembrane signaling receptor activity"
[190] "GO:0060412~ventricular septum morphogenesis"
[191] "KW-0297~G-protein coupled receptor"
[192] "KW-0807~Transducer"
[193] "GO:0007229~integrin-mediated signaling pathway"
[194] "hsa04010:MAPK signaling pathway"
[195] "hsa05142:Chagas disease"
[196] "KW-0356~Hemostasis"
[197] "KW-0094~Blood coagulation"
[198] "KW-0892~Osteogenesis"
[199] "KW-0965~Cell junction"
[200] "GO:0048706~embryonic skeletal system development"
[201] "GO:0001666~response to hypoxia"
[202] "GO:0045746~negative regulation of Notch signaling pathway"
[203] "DOMAIN:EGF-like 2; calcium-binding"
[204] "hsa05202:Transcriptional misregulation in cancer"
[205] "KW-0395~Inflammatory response"
[206] "GO:0042127~regulation of cell proliferation"
[207] "GO:0050728~negative regulation of inflammatory response"
[208] "GO:0042060~wound healing"
[209] "IPR017452:GPCR, rhodopsin-like, 7TM"
[210] "KW-0391~Immunity"
[211] "KW-0873~Pyrrolidone carboxylic acid"
[212] "hsa04020:Calcium signaling pathway"
[213] "hsa04514:Cell adhesion molecules"
[214] "KW-0336~GPI-anchor"
[215] "hsa04350:TGF-beta signaling pathway"
[216] "GO:0061036~positive regulation of cartilage development"
[217] "hsa05032:Morphine addiction"
[218] "GO:0043410~positive regulation of MAPK cascade"
[219] "IPR000276:G protein-coupled receptor, rhodopsin-like"
[220] "IPR003961:Fibronectin, type III"
[221] "GO:0060045~positive regulation of cardiac muscle cell proliferation"
[222] "GO:0007169~transmembrane receptor protein tyrosine kinase signaling pathway"
[223] "GO:0030097~hemopoiesis"
[224] "GO:0006935~chemotaxis"
[225] "GO:0005539~glycosaminoglycan binding"
[226] "IPR016187:C-type lectin fold"
[227] "GO:0048010~vascular endothelial growth factor receptor signaling pathway"
[228] "GO:0035987~endodermal cell differentiation"
[229] "GO:0048008~platelet-derived growth factor receptor signaling pathway"
[230] "GO:0048286~lung alveolus development"
[231] "DOMAIN:Fibronectin type-III"
[232] "GO:0035579~specific granule membrane"
[233] "GO:0060216~definitive hemopoiesis"
[234] "GO:0040037~negative regulation of fibroblast growth factor receptor signaling pathway"
[235] "CARBOHYD:N-linked (GlcNAc...) (complex) asparagine"
[236] "GO:1905370~serine-type endopeptidase complex"
[237] "hsa04979:Cholesterol metabolism"
[238] "GO:0071560~cellular response to transforming growth factor beta stimulus"
[239] "GO:0030168~platelet activation"
[240] "GO:0010595~positive regulation of endothelial cell migration"
[241] "IPR013106:Immunoglobulin V-set"
[242] "IPR013151:Immunoglobulin"
[243] "GO:0001934~positive regulation of protein phosphorylation"
[244] "GO:0032731~positive regulation of interleukin-1 beta production"
[245] "IPR000047:Helix-turn-helix motif"
[246] "IPR001304:C-type lectin"
[247] "GO:0033089~positive regulation of T cell differentiation in thymus"
[248] "GO:0070062~extracellular exosome"
[249] "hsa05205:Proteoglycans in cancer"
[250] "hsa04974:Protein digestion and absorption"
cluster3_sig <- cluster3 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster3_terms <- cluster3_sig$Term) [1] "GO:0005886~plasma membrane"
[2] "KW-0325~Glycoprotein"
[3] "KW-1015~Disulfide bond"
[4] "CARBOHYD:N-linked (GlcNAc...) asparagine"
[5] "KW-0472~Membrane"
[6] "KW-1003~Cell membrane"
[7] "KW-0732~Signal"
[8] "TOPO_DOM:Cytoplasmic"
[9] "KW-0130~Cell adhesion"
[10] "KW-0964~Secreted"
[11] "TOPO_DOM:Extracellular"
[12] "GO:0009986~cell surface"
[13] "KW-0037~Angiogenesis"
[14] "GO:0016021~integral component of membrane"
[15] "GO:0070062~extracellular exosome"
[16] "GO:0005615~extracellular space"
[17] "GO:0005576~extracellular region"
[18] "KW-0106~Calcium"
[19] "GO:0007165~signal transduction"
[20] "GO:0007155~cell adhesion"
[21] "TRANSMEM:Helical"
[22] "GO:0005788~endoplasmic reticulum lumen"
[23] "KW-1133~Transmembrane helix"
[24] "KW-0812~Transmembrane"
[25] "GO:0009897~external side of plasma membrane"
[26] "IPR011993:Pleckstrin homology-like domain"
[27] "GO:0016020~membrane"
[28] "KW-0768~Sushi"
[29] "hsa04510:Focal adhesion"
[30] "KW-0735~Signal-anchor"
[31] "KW-0344~Guanine-nucleotide releasing factor"
[32] "KW-0245~EGF-like domain"
[33] "KW-0084~Basement membrane"
[34] "GO:0007160~cell-matrix adhesion"
[35] "hsa05200:Pathways in cancer"
[36] "hsa04015:Rap1 signaling pathway"
[37] "hsa04014:Ras signaling pathway"
[38] "hsa04640:Hematopoietic cell lineage"
[39] "GO:0005856~cytoskeleton"
[40] "KW-0034~Amyloid"
[41] "GO:0007264~small GTPase mediated signal transduction"
[42] "GO:0001525~angiogenesis"
[43] "hsa04512:ECM-receptor interaction"
[44] "KW-0272~Extracellular matrix"
[45] "KW-0391~Immunity"
[46] "GO:0005794~Golgi apparatus"
[47] "hsa04148:Efferocytosis"
[48] "GO:0005925~focal adhesion"
cluster4_sig <- cluster4 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster4_terms <- cluster4_sig$Term) [1] "GO:0005730~nucleolus"
[2] "GO:0003723~RNA binding"
[3] "GO:0005654~nucleoplasm"
[4] "KW-0539~Nucleus"
[5] "GO:0005694~chromosome"
[6] "GO:0006364~rRNA processing"
[7] "KW-0597~Phosphoprotein"
[8] "KW-0007~Acetylation"
[9] "KW-0698~rRNA processing"
[10] "KW-0690~Ribosome biogenesis"
[11] "KW-0235~DNA replication"
[12] "GO:0032040~small-subunit processome"
[13] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO2)"
[14] "KW-0067~ATP-binding"
[15] "GO:0006260~DNA replication"
[16] "KW-0158~Chromosome"
[17] "KW-0809~Transit peptide"
[18] "KW-0694~RNA-binding"
[19] "KW-0832~Ubl conjugation"
[20] "KW-1017~Isopeptide bond"
[21] "hsa04110:Cell cycle"
[22] "KW-0347~Helicase"
[23] "KW-0496~Mitochondrion"
[24] "GO:0005524~ATP binding"
[25] "GO:0016887~ATPase activity"
[26] "hsa03030:DNA replication"
[27] "KW-0853~WD repeat"
[28] "GO:0005634~nucleus"
[29] "GO:0042274~ribosomal small subunit biogenesis"
[30] "KW-0547~Nucleotide-binding"
[31] "hsa03008:Ribosome biogenesis in eukaryotes"
[32] "GO:0005739~mitochondrion"
[33] "GO:0006268~DNA unwinding involved in DNA replication"
[34] "KW-0131~Cell cycle"
[35] "GO:0000462~maturation of SSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)"
[36] "TRANSIT:Mitochondrion"
[37] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO2); alternate"
[38] "REPEAT:WD 2"
[39] "REPEAT:WD 1"
[40] "REPEAT:WD 3"
[41] "SM00490:HELICc"
[42] "REPEAT:WD 6"
[43] "GO:0071162~CMG complex"
[44] "DOMAIN:Helicase ATP-binding"
[45] "REPEAT:WD 5"
[46] "REPEAT:WD 4"
[47] "GO:0030687~preribosome, large subunit precursor"
[48] "SM00320:WD40"
[49] "DOMAIN:Helicase C-terminal"
[50] "GO:0042555~MCM complex"
[51] "KW-0498~Mitosis"
[52] "GO:0017116~single-stranded DNA-dependent ATP-dependent DNA helicase activity"
[53] "REPEAT:WD 7"
[54] "COMPBIAS:Acidic residues"
[55] "SM00487:DEXDc"
[56] "GO:0003678~DNA helicase activity"
[57] "GO:0003688~DNA replication origin binding"
[58] "SM00350:MCM"
[59] "KW-0658~Purine biosynthesis"
[60] "GO:0006270~DNA replication initiation"
[61] "GO:0042273~ribosomal large subunit biogenesis"
[62] "IPR001650:Helicase, C-terminal"
[63] "IPR012340:Nucleic acid-binding, OB-fold"
[64] "GO:0044208~'de novo' AMP biosynthetic process"
[65] "DOMAIN:MCM"
[66] "IPR001680:WD40 repeat"
[67] "GO:0042645~mitochondrial nucleoid"
[68] "IPR014001:Helicase, superfamily 1/2, ATP-binding domain"
[69] "IPR001208:Mini-chromosome maintenance, DNA-dependent ATPase"
[70] "GO:0003724~RNA helicase activity"
[71] "GO:0030174~regulation of DNA-dependent DNA replication initiation"
[72] "GO:0030515~snoRNA binding"
[73] "GO:0000398~mRNA splicing, via spliceosome"
[74] "IPR018525:Mini-chromosome maintenance, conserved site"
[75] "IPR015943:WD40/YVTN repeat-like-containing domain"
[76] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO1); alternate"
[77] "KW-0489~Methyltransferase"
[78] "GO:0006189~'de novo' IMP biosynthetic process"
[79] "hsa03430:Mismatch repair"
[80] "IPR011545:DNA/RNA helicase, DEAD/DEAH box type, N-terminal"
[81] "KW-0137~Centromere"
[82] "KW-0132~Cell division"
[83] "GO:0005759~mitochondrial matrix"
[84] "GO:0006281~DNA repair"
[85] "COMPBIAS:Basic and acidic residues"
[86] "IPR027417:P-loop containing nucleoside triphosphate hydrolase"
[87] "hsa03013:Nucleocytoplasmic transport"
[88] "GO:0003682~chromatin binding"
[89] "GO:0051301~cell division"
[90] "GO:0097294~'de novo' XMP biosynthetic process"
[91] "GO:0004386~helicase activity"
[92] "GO:0030532~small nuclear ribonucleoprotein complex"
[93] "GO:0000727~double-strand break repair via break-induced replication"
[94] "GO:0006397~mRNA processing"
[95] "KW-0436~Ligase"
[96] "KW-0949~S-adenosyl-L-methionine"
[97] "GO:0006271~DNA strand elongation involved in DNA replication"
[98] "KW-0227~DNA damage"
[99] "KW-0747~Spliceosome"
[100] "h_mcmPathway:CDK Regulation of DNA Replication"
[101] "IPR019775:WD40 repeat, conserved site"
[102] "KW-0507~mRNA processing"
[103] "MOTIF:Arginine finger"
[104] "GO:0005515~protein binding"
[105] "GO:0006177~GMP biosynthetic process"
[106] "GO:0000466~maturation of 5.8S rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)"
[107] "KW-0234~DNA repair"
[108] "GO:0000463~maturation of LSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)"
[109] "KW-0508~mRNA splicing"
[110] "MOTIF:Nuclear localization signal"
[111] "GO:0003697~single-stranded DNA binding"
[112] "KW-0030~Aminoacyl-tRNA synthetase"
[113] "hsa03410:Base excision repair"
[114] "GO:0045943~positive regulation of transcription from RNA polymerase I promoter"
[115] "GO:0000781~chromosome, telomeric region"
[116] "GO:0019899~enzyme binding"
[117] "KW-0819~tRNA processing"
[118] "GO:0005743~mitochondrial inner membrane"
[119] "KW-1135~Mitochondrion nucleoid"
[120] "KW-0995~Kinetochore"
[121] "GO:0006164~purine nucleotide biosynthetic process"
[122] "hsa00670:One carbon pool by folate"
[123] "REPEAT:WD 13"
[124] "GO:0016604~nuclear body"
[125] "MOTIF:Q motif"
[126] "GO:0071013~catalytic step 2 spliceosome"
[127] "GO:0008168~methyltransferase activity"
[128] "SM00360:RRM"
[129] "GO:1990904~ribonucleoprotein complex"
[130] "REPEAT:WD 8"
The significant results for gene set 1 include associations with neurological functions and pathways, for instance signalling at the neurological synapse.
For gene set 2, many significant terms are found (250), which might be due to the large size of this gene set. The most significant terms included for instance Glycoprotein, Developmental Protein, Homeobox/domain and Extracellular Matrix.
The results of gene set 3 include associations with the cell surface and the cell membrane as well as cellular signalling and cell adhesion.
Lastly, the results of gene sets 4 again include more terms then the ones of gene sets 1 and 3, but less then gene set 2. This seems to be surprising when just looking at the sizes of the genes sets, since gene set 4 is with 680 genes by far the smallest gene set. However, as described above, the average expression of genes cluster 4 does not follow a trend similar to what can be observed in the other clusters, but it seems to fluctuate over the course of time. This might be be caused by cluster 4 being more heterogeneous than the other clusters, which could also explain why many more terms were found for this cluster than for clusters 1 and 3, even though they are larger. The significant terms of gene set 4 include functions related to the nucleus, DNA replication, transcription and translation.
When looking at all functions or pathways that DAVID found an association with, it seems like many different cellular components are involved in the changes over the course of time. Especially for the larger clusters, many different gene ontology terms are significant. From this analysis, it is difficult to make assumptions about specific pathways or cellular functions that are up- or down-regulated over the course of time. However, it can be said that the treatment of the cells with differentiation factors possible affects many components of the cells.
GSEA (ranked-based)
# cluster 1
genesets <- msigdbr(species = "Homo sapiens", category = "C8")
genesets_list <- tapply(genesets$ensembl_gene, genesets$gs_name, list)
fgsea_kegg <- fgsea(
pathways = genesets_list,
stats = ranked_list[names_cluster1],
scoreType = "pos",
minSize = 15,
maxSize = 500
)Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.71% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
pathway pval padj
1: MANNO_MIDBRAIN_NEUROTYPES_HNPROG 0.006283566 0.3393126
2: ZHONG_PFC_MAJOR_TYPES_ASTROCYTES 0.006283566 0.3393126
3: MANNO_MIDBRAIN_NEUROTYPES_HPROGM 0.052947053 0.5688857
4: FAN_OVARY_CL3_MATURE_CUMULUS_GRANULOSA_CELL_1 0.047952048 0.5688857
5: HU_FETAL_RETINA_RPE 0.032967033 0.4720280
6: FAN_EMBRYONIC_CTX_BIG_GROUPS_CAJAL_RETZIUS 0.016371489 0.4720280
7: FAN_EMBRYONIC_CTX_ASTROCYTE_2 0.090909091 0.6293706
8: DESCARTES_FETAL_CEREBELLUM_ASTROCYTES 0.094905095 0.6293706
9: MANNO_MIDBRAIN_NEUROTYPES_HOPC 0.028971029 0.4720280
10: DESCARTES_FETAL_INTESTINE_ENS_NEURONS 0.057942058 0.5688857
log2err ES NES size
1: 0.4070179 0.9545631 1.160016 21
2: 0.4070179 0.9339959 1.144380 27
3: 0.1938133 0.9348828 1.130055 15
4: 0.2042948 0.9297943 1.126093 17
5: 0.2489111 0.9151883 1.118347 24
6: 0.3524879 0.8969409 1.107767 39
7: 0.1446305 0.9149529 1.105964 15
8: 0.1412251 0.9075101 1.099104 17
9: 0.2663507 0.8835544 1.091233 39
10: 0.1847065 0.8810065 1.082413 30
# cluster 2
genesets <- msigdbr(species = "Homo sapiens", category = "C8")
genesets_list <- tapply(genesets$ensembl_gene, genesets$gs_name, list)
fgsea_kegg <- fgsea(
pathways = genesets_list,
stats = ranked_list[names_cluster2],
scoreType = "pos",
minSize = 15,
maxSize = 500
)Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.9% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
pathway pval
1: DESCARTES_FETAL_EYE_VASCULAR_ENDOTHELIAL_CELLS 1.805719e-06
2: AIZARANI_LIVER_C28_NK_NKT_CELLS_6 3.115018e-05
3: AIZARANI_LIVER_C3_NK_NKT_CELLS_2 1.938514e-04
4: CUI_DEVELOPING_HEART_LEFT_VENTRICULAR_CARDIOMYOCYTE 3.121505e-04
5: ZHONG_PFC_C4_PTGDS_POS_OPC 1.979419e-03
6: FAN_EMBRYONIC_CTX_BRAIN_NAIVE_LIKE_T_CELL 3.126345e-03
7: TRAVAGLINI_LUNG_VASCULAR_SMOOTH_MUSCLE_CELL 6.804092e-03
8: CUI_DEVELOPING_HEART_C4_ENDOTHELIAL_CELL 9.378443e-06
9: DESCARTES_MAIN_FETAL_RETINAL_PIGMENT_CELLS 1.907497e-02
10: GAUTAM_EYE_IRIS_CILIARY_BODY_ACTIVATED_T_CELLS 3.896104e-02
padj log2err ES NES size
1: 0.0002269187 0.6435518 0.9927855 1.148612 53
2: 0.0019572694 0.5573322 0.9994276 1.147286 20
3: 0.0104402812 0.5188481 0.9979229 1.144721 23
4: 0.0147100928 0.4984931 0.9966341 1.143665 22
5: 0.0788085005 0.4317077 0.9914820 1.137031 21
6: 0.1028687075 0.4317077 0.9900295 1.136086 22
7: 0.1574869991 0.4070179 0.9889374 1.134915 18
8: 0.0008839182 0.5933255 0.9685126 1.126029 82
9: 0.2282658852 0.3524879 0.9779419 1.124117 17
10: 0.2824675325 0.2279872 0.9742387 1.120215 15
# cluster 3
genesets <- msigdbr(species = "Homo sapiens", category = "C8")
genesets_list <- tapply(genesets$ensembl_gene, genesets$gs_name, list)
fgsea_kegg <- fgsea(
pathways = genesets_list,
stats = ranked_list[names_cluster3],
scoreType = "pos",
minSize = 15,
maxSize = 500
)Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.11% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
pathway
1: DESCARTES_FETAL_LUNG_VISCERAL_NEURONS
2: DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_MATURE_NEURONS
3: CUI_DEVELOPING_HEART_C4_ENDOTHELIAL_CELL
4: TRAVAGLINI_LUNG_ARTERY_CELL
5: DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_OLFACTORY_ENSHEATHING_GLIA
6: DESCARTES_FETAL_MUSCLE_SKELETAL_MUSCLE_CELLS
7: FAN_OVARY_CL12_T_LYMPHOCYTE_NK_CELL_2
8: TRAVAGLINI_LUNG_PERICYTE_CELL
9: FAN_OVARY_CL4_T_LYMPHOCYTE_NK_CELL_1
10: TRAVAGLINI_LUNG_LIPOFIBROBLAST_CELL
pval padj log2err ES NES size
1: 0.0009021464 0.2222787 0.4772708 0.9690884 1.238612 15
2: 0.0016107149 0.2222787 0.4550599 0.9471298 1.217774 20
3: 0.0033483310 0.3080465 0.4317077 0.9201090 1.187868 23
4: 0.0157707171 0.4352718 0.3524879 0.9088680 1.170608 21
5: 0.0047219908 0.3258174 0.4070179 0.9007948 1.166225 27
6: 0.0299700300 0.5908377 0.2616635 0.9027255 1.157119 17
7: 0.0138248394 0.4239617 0.3807304 0.8938626 1.153984 23
8: 0.0249750250 0.5908377 0.2878571 0.8939645 1.149416 20
9: 0.0079960657 0.3953187 0.3807304 0.8833947 1.148474 31
10: 0.0399600400 0.6893107 0.2249661 0.8900726 1.141058 18
# cluster 4
genesets <- msigdbr(species = "Homo sapiens", category = "C8")
genesets_list <- tapply(genesets$ensembl_gene, genesets$gs_name, list)
fgsea_kegg <- fgsea(
pathways = genesets_list,
stats = ranked_list[names_cluster4],
scoreType = "pos",
minSize = 15,
maxSize = 500
)Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.88% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
pathway pval padj
1: RUBENSTEIN_SKELETAL_MUSCLE_SMOOTH_MUSCLE_CELLS 0.01667187 0.5934066
2: TRAVAGLINI_LUNG_TREM2_DENDRITIC_CELL 0.04095904 0.5934066
3: MURARO_PANCREAS_ALPHA_CELL 0.04195804 0.5934066
4: MANNO_MIDBRAIN_NEUROTYPES_HENDO 0.04495504 0.5934066
5: LAKE_ADULT_KIDNEY_C12_THICK_ASCENDING_LIMB 0.06993007 0.6360698
6: MANNO_MIDBRAIN_NEUROTYPES_HNPROG 0.09090909 0.6360698
7: BUSSLINGER_GASTRIC_IMMUNE_CELLS 0.02597403 0.5934066
8: MANNO_MIDBRAIN_NEUROTYPES_HNBGABA 0.08991009 0.6360698
9: MANNO_MIDBRAIN_NEUROTYPES_HDA1 0.11788212 0.6360698
10: FAN_OVARY_CL10_PUTATIVE_EARLY_ATRESIA_GRANULOSA_CELL 0.15084915 0.6360698
log2err ES NES size
1: 0.3524879 0.8347887 1.235206 17
2: 0.2220560 0.8057483 1.192236 17
3: 0.2192503 0.8007492 1.186913 18
4: 0.2114002 0.7904029 1.174892 19
5: 0.1669338 0.7772472 1.155337 19
6: 0.1446305 0.7796743 1.151660 16
7: 0.2820134 0.7458914 1.147234 45
8: 0.1455161 0.7738129 1.146986 18
9: 0.1250334 0.7675451 1.135708 17
10: 0.1083943 0.7659914 1.127887 15
Discussion
Differntial Gene Expression Analysis
The aim of the differential gene expression analysis was to identify genes that are differentially expressed over the course of time, and to compare these results of the Illumina+Salmon data (short read data) with the results of the PacBio+Bambu data (long read data) for 4 different NDR cutoffs.
When comparing the different NDR cutoffs, only minor differences could be detected. This was expected since the main difference between the data sets of the different cutoffs were the number of newly detected genes. What was also expected about the differential gene expression analysis was the quite large number of genes that were called differentially expressed. For all NDR cutoffs, the fraction of genes called differentially expressed out of all analyzed genes was around 30%. This was not unexpected since the RNA samples came from induced pluripotent stem cells that were treated with differentiation factors.
The more interesting part was the comparison between the short read data and the long read data. The short read data was produced from Illumina sequencing. Since Illumina sequencing has been in use for some time and therefore underwent lots of optimization, it is was treated as a ground truth in this project. The long read data, however, was produced from PacBio sequencing, which suffers still from some biases, like a fragment length bias. Here, the data was produced using a new protocol which might display an optimization for PacBio sequencing. To evaluate this, the results of the PacBio data was compared to the Illumina data results. It could be shown that for both cases roughly the same amount of genes were identified as differentially expressed. However, when overlaying these genes, there were still many genes only called differentially expressed in one of the two cases. Moreover, when comparing the adjusted p-values and logFC values of the likelihood ratio tests, deviations between the two cases were visible. Nevertheless, a common trend could be detected, also for the differential gene expression when comparing only days 0 and 5. Together with the fact that the majority of differential expressed genes tend to be called by both methods, this shows that the analysis of the PacBio data already comes close to the truth, when treating Illumina as the truth. Due to the lack of PacBio data from older protocols, it is not possible to draw a conclusion regarding the improvement of this protocol compared to others.
Grouping of Differentially Expressed Genes
The grouping of differentially expressed genes was only an intermediate step between the differential expression analysis and the gene set analysis. Nonetheless, it showed some interesting results and can possible influence the outcome of the gene set analysis.
The number of clusters/sets the genes were grouped in (here 4) was choosen more or less arbitrarily, mostly guided by visual inspection. The since the number of clusters was rather small, and the number of genes that were grouped was rather large, most of the clusters were rather large as well, they mostly contained more than 1000 genes. Since the clusters were used as input gene sets for the gene set analysis, this choice possible affected the outcome of the gene set analysis and will therefore be discussed again in the context of the gene set analysis.
Gene Set Analysis
The gene set analysis was performed with the aim of making sense of the genes that the analysis identified as differentially expressed. To do so, the detected differentially expressed genes were grouped into 4 gene sets, which were then used as input for the gene set analysis with the web tool DAVID.
For the gene set 1 and 3, around 30 to 50 gene ontology terms were found to be significantly enriched in the gene sets of about 1200 - 1500 genes. The terms for gene set 1 were mostly related to the neurological synapse and signalling between neurons, meaning that this cluster possible contains an enriched amount of genes involved in neurological signalling. For gene set 3, the terms were mostly related to the cell membrane, the cytoplasm and signal transduction. The terms found significant for these two gene sets were rather homogeneous. The same could be observed for gene set 4, which genes were found to be involved mostly in functions associated with the cell nucleus and RNA/DNA. Notably, the amount of terms that were detected to be significantly enriched in gene set 4 was much higher (130 terms) than in gene sets 1 and 3, even though gene set 4 was only around half the size of the other two (there are 680 genes in gene set 4). Assuming that more genes in a gene set usually lead to more terms to be found significant, this could mean that gene set 4 is less homogeneous than gene sets 1 and 3. The largest gene set, gene set 4, shows the highest amount of significant terms (250 while containing 2158 genes). The results of this gene sets contain many terms associated with the Homeobox/Homeodomain and other developmental functions, but also with many other things like cell membrane/surface, signalling or the immune system. These results seem to be more heterogeneous than the results of the other 3 gene sets, which could be caused by the larger size of gene set 2 and/or by more heterogeneity between the genes of this gene set. To overcome this issue and get clearer results also for gene set 2, the number of clusters during the grouping of differentially expressed genes could have been choosen larger to create smaller and therefore possible more homogeneous gene sets. However, when cutting the tree with the cutree function, the largest cluster, so gene set 2, gets only notable smaller when setting the number of cluster to 13. At this point, many small clusters are create, with less than 50 genes. Therefore, the number of clusters was left to be 4.
In general, the terms that were found to be significant by the gene set analysis for all gene sets cover a lot of different functions and pathways in the cell.
Outlook
Literature
- Benjamini, Y. and Hochberg, Y. (1995), Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57: 289-300.
- Gehlenborg N (2019). UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets. https://CRAN.R-project.org/package=UpSetR.
- Posit team (2023). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL http://www.posit.co/.
- Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140.
- Sherman, B. T., Hao, M., Qiu, J., Jiao, X., Baseler, M. W., Lane, H. C., Imamichi, T., & Chang, W. (2022). DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic acids research 50, 216-221.
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Zurich
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] msigdbr_7.5.1 fgsea_1.26.0 gplots_3.1.3 pheatmap_1.0.12
[5] viridis_0.6.4 viridisLite_0.4.2 UpSetR_1.4.0 edgeR_3.42.4
[9] limma_3.56.2 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[13] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[17] tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] fastmatch_1.1-4 babelgene_22.9 gtable_0.3.4
[4] xfun_0.41 htmlwidgets_1.6.3 caTools_1.18.2
[7] lattice_0.22-5 tzdb_0.4.0 vctrs_0.6.4
[10] tools_4.3.1 bitops_1.0-7 generics_0.1.3
[13] parallel_4.3.1 fansi_1.0.5 pkgconfig_2.0.3
[16] Matrix_1.6-4 KernSmooth_2.23-22 data.table_1.14.8
[19] RColorBrewer_1.1-3 lifecycle_1.0.4 farver_2.1.1
[22] compiler_4.3.1 munsell_0.5.0 codetools_0.2-19
[25] htmltools_0.5.7 yaml_2.3.7 pillar_1.9.0
[28] BiocParallel_1.34.2 nlme_3.1-164 gtools_3.9.5
[31] tidyselect_1.2.0 locfit_1.5-9.8 digest_0.6.33
[34] stringi_1.8.2 labeling_0.4.3 splines_4.3.1
[37] cowplot_1.1.1 fastmap_1.1.1 grid_4.3.1
[40] colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[43] utf8_1.2.4 withr_2.5.2 scales_1.3.0
[46] timechange_0.2.0 rmarkdown_2.25 gridExtra_2.3
[49] hms_1.1.3 evaluate_0.23 knitr_1.45
[52] mgcv_1.9-0 rlang_1.1.2 Rcpp_1.0.11
[55] glue_1.6.2 rstudioapi_0.15.0 jsonlite_1.8.7
[58] R6_2.5.1 plyr_1.8.9